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ABSTRACT 

A number of signal processing techniques have been 
recently proposed to locate periodicities in DNA se- 
quences. A key building block in any of these methods 
is the computation of the M-point short-time discrete 
Fourier transform X{n,k) at a particular frequency 
component k = M / R where R is an integer. In this 
paper and using multirate signal processing theory, 
we present a computationally efficient approach to 
find X{n,M / R). In specific, we first derive a new 
multirate DSP model that i) computes X{n,R) for 
any given integer R, and ii) is completely parameter- 
ized by a set of real digital filters Hr{z). We then 
obtain a closed-form expression for X{n,M/R) that 
clearly quantifies the computational efficiency of the 
new method, and finally discuss the design of Hr(z) 
to enhance the computation accuracy of A(n, M/ R). 

1. INTRODUCTION 

The application of signal processing techniques to an- 
alyze the structure of DNA sequences has been an ac- 
tive area of research in recent years. A fundamental 
component of such methods has been the deployment 
of the discrete Fourier transform (DFT) to identify 
periodicities within DNA sequences [1, 2, 3]. These 
periodicities are typically used to characterize impor- 
tant biological features. For example, the use of the 
magnitude squared of the DFT component at the fre- 
quency uj = 27r/3 has been proposed to discriminate 
between protein coding and non-coding regions in a 
DNA sequence [4, 5, 6, 7]. A measure that depends on 
the phase of the same DFT component has also been 
suggested for gene finding [8, 9]. Other DNA period- 
icities were found in [10] and in [11] using the DFT 
and the warped DFT (W-DFT) respectively. Most of 
these methods however suffer from a high computa- 
tional cost especially when searching over large DNA 
sequences in the order of thousands of nucleotides. 
The development of efficient computational tools for 
analyzing DNA sequences is a major requirement in 
genomic research. In this paper, we present a novel 



efficient algorithm that finds the short time DFT (ST- 
DFT) at any frequency 2tt/ R where R is an integer. 
The algorithm is based on a new DSP model that 
i) generalizes previous work (namely [6, 7]), ii) pro- 
vides a closed form expression for the ST-DFT at 
iv = 2Tr/R, and iii) reduces substantially the com- 
putational cost of this frequency component. The 
new scheme is also completely characterized by a set 
of real digital filters which, in turn, can be designed 
to enhance the accuracy of the frequency component. 
Although we mainly focus on DNA sequence analysis, 
our findings apply to many other DSP applications. 

2. SYMBOLIC TO NUMERIC MAPPING 

A single DNA strand is represented as a sequence of 
letters that belong to the alphabet F = {A, C, G, T}. 
The letters are the standard symbols given to the four 
nucleotides: Adenine A, Guanine G, Thymine T, and 
Cytosine C. Along the two strands of the DNA double 
helix, a pyrimidine in one chain always faces a purine 
in the other with a specific pairing of the bases: T 
with A and, C with G. For example, a typical section 
of a DNA double-strand is shown in Figure 1, where 
the 5' and 3' are labels for the two ends of the nu- 
cleotide. Mapping a DNA sequence to a set of digital 
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Figure 1 : DNA double helix structure 

signals is in general a common problem in the context 
of DNA data generation and several apparently dis- 
tinct and unrelated approaches exist. The simplest 
mapping solution is the Voss representation which 
forms four binary indicator sequences a;;(n),V I G F 
where a 1 would indicate the presence of a base and 
indicates its absence. For example, the mapping of 
the top single DNA strand in Figure 1 into the indi- 
cator sequence XA{n) generates {1,0,0,0,0,0,0,1,1}. 
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3. A GENERAL MULTIRATE DSP MODEL 

Assume that a discrete time sequence xi (n) has length 
N. The M-point ST-DFT of xi(n) is 



Af-l 



Xi{n, k) = ^ x;(n + m)e 



-j2TTmk/M 



(1) 



where n = 0, P, ..., {N — AI -\- I) and P is the amount 
of window shift. Given the four binary indicator se- 
quences XI (n), V / G F, we can therefore find the corre- 
sponding DFT sequences Xi(n, fc),V ^ G F. To deter- 
mine the value of the spectrum at a specific frequency 
uj = 21T / R where R is an integer, we evaluate Xi{n, k) 
at the index k = M / R. We assume here that the ST- 
DFT length M is a multiple of R, i.e. M = qR for 
some integer q. From (1), it follows that 



M 



M-l 



Xi{n) = Xi{n, — ) = ^ xi{n + m)e 



-j2TTm/R 



(2) 



factors. Then, the suggested model can be used to lo- 
cate the -^-frequency component iff R = 11^=1 -^^1"% 
where n^ G {0,1}, 11^=1 '^j ¥" Oj ^n^d Mi ^ 1,V i G 
{1, 2, . . . , B}. Second, we note that the complex fil- 
ter C'{z) has R—1 zeros on the unit circle, starting at 
u; = and spaced by -^ except at w = -^ as shown in 
Figure 5(b). The interpolated version oi H{z), H{z ), 
produces frequency images at u; = 0, %- • ■ ■ 27r-^^^ 
as given in Figure 6(f). By cascading the two filters, 
we can see from the above figures that C'{z) attenu- 
ates the images at the unwanted frequencies of H{z ) 
and leaves only one opening at the required frequency 
u! = ^. The resulting filter F{z) is therefore a com- 
plex band pass filter with center frequency uj = 27t/ R. 
Finally, by invoking the polyphase decomposition, we 
can derive the new model of Figure 3 where, in this 
case, H-r{z) = H{z) for all r = 0, 1, . . . , i? — 1. Li 
the more general setting illustrated in Figure 3, the 
sequences Xi^ (n) is given by 



The direct evaluation of (2) is expensive since it in- 
volves the multiplication of the sequence samples by 
the M complex exponentials {e^-'^'^™/^}. Further- 
more, these operations are repeated with every win- 
dow shift P . In the remainder of this paper, we show 
that such a computational cost can be in fact greatly 
reduced. To do this, \et P = R and note that equa- 
tion (2) can be interpreted as the convolution of a;;(n) 
with the complex digital filter f{n) 



fin) = 







< n < M - 1 
otherwise 



followed by decimation by R. Let F{z) = Z{f{n)} 
denote the Z transform of f{n). We can then show 
that F{z) can be expressed as a two-stage filter ob- 
tained by cascading the complex filter 

Ciz) = 1 + e^'lf z-i + • . . + e^'(«-i)*^-(«-i) 

with the real interpolated filter 

F(z^) = l + z-^ + ... + z-(^^-i) 

as illustrated in Figure 2 where the box labelled | R 
is a downsampler with decimation ratio R. A num- 
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Figure 2: Cascade implementation of F{z) 

ber of remarks are in order at this point. First, let 
p 

M = Y[i=i -^^ij where B is the number of A/'s prime 



Xir{n) = hr{n) * xir{n) 



(3) 



and is termed the r filtered polyphase component of 
Xi{n), defined V / G F, and V r G {0, 1, ..., R - 1}. 
From Figure 3, we can immediately observe that the 
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Figure 3: Multirate DSP model for -^-frequency com- 
ponent extraction 



new structure separates the real part oi F(z), namely 
H{z), from its complex part, namely C{z). More im- 
portantly, the complex part C{z) is fixed and only 
R <^ M complex computations are required in this 
case. Finally, we emphasize that the above scheme 
is a generalization of the DSP model derived in [7] 
for the specific case of i? = 3. From Figure 3, we ob- 
tain closed form expressions for the filtered polyphase 
components Xir(n) and subsequently for the DFT 
Xi[n) in terms of the polyphase components oixi{n). 
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4. MATHEMATICAL ANALYSIS 



5. COMPLEXITY COMPARISON 



We first split tlie discrete time sequence xi (n) into its 
R polyphase components with decimation ratio R to 
get xi{n + Rm), xi{n + Rm + 1) ..., xi{n + Rm -\- R — 
1). These R polyphase components are generated by 
passing the signal xi{n) through an i?-branch multi- 
rate blocking structure composed of a delay chain fol- 
lowed by downsamplers (J, R) as shown in Figure 3. 
The polyphase sequences of the windowed sequence 
XA{n) with R = 5 and M = 10 are given in Figure 4. 
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Figure 4: Polyphase sequences of a 10-sample window 
of Xa {n) with R = 5 



We now can re-express Xi{n) in (2) as follows 
Xi{n) 



y xi{n + Rm) 

m=0 

+ ^ xi{n + Rm + l)e^^^ + ... 

m=0 

+ Yl xi{n + Rm + R-l)e-^''^-'^^^ 

m=0 

R-1 Lf-iJ 
= ^ ^ xi{n + Rm + r)e^^''^{A) 

r=0 m=r,r-^R,... 

The inner summation in (4) represents Xir{n) and 



is simply the sum of the samples in the 



„th 



Ddon 



position of xi{n) or, in other words, the number of 
occurrences of nucleotide I in codon position r of the 
window at index n. We can therefore define 

Xlrin) = Yl x{n + Rm + r), (5) 

77i=r,r+i?,... 

and obtain the following closed form expression 
R-i 
Xi{n) ^ ^Xi,(n)e-J'2Wi? (6) 

Note that equation (5) is indeed the convolution op- 
eration given in (3) where, in this case, hr{n) is a 
rectangular window. From (6), Xi(n) is obtained by 
multiplying Xir{n) with the corresponding complex 
coefficients and adding the outputs as in Figure 3. 



To quantify the comparison between the direct DFT 
expression and the multirate model alternative, we 
compare the computation complexity in both cases. 
Equation (2) finds the frequency component at ^ us- 
ing {M} complex multiplications and {M — 1} com- 
plex additions for every nucleotide base I. Typical 
DNA sequences can be hundreds of thousands of bases 
which indicates that the direct computation of (2) is 
not a good alternative when analyzing huge databases. 
On the other hand, equation (5) can be evaluated us- 
ing i^L^^J — M real additions. Note that these ad- 
ditions are actually simple increments since we are 
adding up the elements of a binary sequence. In 
turn, equation (6) requires {R} complex multiplica- 
tions and {R} complex additions. Therefore, the pro- 
posed multirate DSP model finds Xi(n), V Z G F, us- 
ing {2R} real multiplications and {M + 2R} real ad- 
ditions, compared to {2A/} real multiplications and 
{2(A/ — 1)} real additions for the direct DFT imple- 
mentation as summarized in Table 1. Besides these 
quantitative results, the implementation cost can be 
further reduced knowing that typically R <^ M and 
that {Af } out of the {M + 2i?} additions required by 
the multirate analysis are actually simple increments. 



Model 


Multiplications 


Additions 


Multirate 


2R 


M + 2R 


Direct DFT 


2M 


2{M - 1) 



Table 1 : Real computations required by the multirate 
model versus the direct DFT implementation V / G F. 



6. CHOOSING THE DIGITAL FILTERS 

Consider again the proposed model of Figure 3. For 
simplicity, we confine our discussion to the case where 
Hr{z) = H{z), V r. The exposition of sections 3 
and 4 focussed on the special case where H{z) is a 
rectangular window. Figure 7(a) shows the magni- 
tude response \F{z)\ = \C{z)H{z^)\ when H{z) is 
the standard rectangular window and M = 500. Al- 
though F{z) is a complex bandpass filter with center 
frequency at a; = 27r/5, it displays a poor magnitude 
response. A sharper response can be obtained by us- 
ing a more general FIR filter H{z). General FIR win- 



dows H{z) = ao + aiz + 



+ "(f -1) z 



-(¥-1) 



can be therefore utilized to replace the rectangular 
case in the model of Figure 3. Table 2 compares the 
noise rejection performance of four standard windows. 
Obviously, the Blackman window enjoys a highly at- 
tenuated first side-lobe which is the main parameter 
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affecting tlie noise attenuation. Figure 7(b) stiows tlie 
magnitude response |-F(5;)| = \C{z)H(z^)\ wtien H{z) 
is a Biacknian window and M = 500. Tiie center fre- 
quency is at a; = 27r/5 but all harmonics are below 
—40 dB and the first side-lobe goes down to almost 
—60 dB. Hence, the filtered model will accurately pick 
the required frequency component. Finally, a poten- 
tial boost in performance can be obtained by biologi- 
cally optimizing H[z) or more generally, Hr{z) V r. 
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FIR window 


Ai/Ao 


At^„ 


20 logio<5 


Rectangular 


-13 


4n/{M+l) 


-21 


Bartlett 


-25 


8n/M 


-25 


Hamming 


-41 


Stt/M 


-53 


Blackman 


-57 


127r/M 


-74 



Table 2: Comparison of common windows: peak side- 
lobe amplitude Ai/Ao in dB, approximate main lobe 
width Awm, and peak approximation error 5 in dB. 
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Figure 5: Complex filter C{z): (a) Magnitude re- 
sponse |C(e^'^)|, (b) zero-pole plot, for R = 5. 
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Figure 6: (a),(b) Impulse responses h{n) , h{n)\^ r, 
(c),(d) magnitude responses |F(e-'''^)|, \H{e^'^^)\, and 
(e),(f) zero-pole plots of the real LP filters II{z), 
H{z^) respectively, for M = 25, R = 5. 
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Blackman window, for R = 5 and M = 500. 
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